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Annotation transfer is a principal process in genome annotation. It involves "transferring" structural and 
functional annotation to uncharacterized open reading frames (ORFs) in a newly completed genome from 
experimentally characterized proteins similar in sequence. To prevent errors in genome annotation, it is 
important that this process be robust and statistically well-characterized, especially with regard to how it 
depends on the degree of sequence similarity. Previously, we and others have analyzed annotation transfer in 
single-domain proteins. Multi-domain proteins, which make up the bulk of the ORFs in eukaryotic genomes, 
present more complex issues in functional conservation. Here we present a large-scale survey of annotation 
transfer in these proteins, using scop superfamilies to define domain folds and a thesaurus based on 
SWISS-PROT keywords to define functional categories. Our survey reveals that multi-domain proteins have 
significantly less functional conservation than single-domain ones, except when they share the exact same 
combination of domain folds. In particular, we find that for multi-domain proteins, approximate function can be 
accurately transferred with only 35% certainty for pairs of proteins sharing one structural superfamily. In 
contrast, this value is 67% for pairs of single-domain proteins sharing the same structural superfamily. On the 
other hand, if two multkiomain proteins contain the same combination of two structural superfamilies the 
probability of their sharing the same function increases to 80% in the case of complete coverage along the full 
length of both proteins, this value increases further to > 90%. Moreover, we found that only 70 of the current 
total of 455 structural superfamilies are found in both single and multi-domain proteins and only 14 of these 
were associated with the same function in both categories of proteins. We also investigated the degree to which 
function could be transferred between pairs of multi-domain proteins with respect to the degree of sequence 
similarity between them, finding that functional divergence at a given amount of sequence similarity is always 
about two-fold greater for pairs of multi-domain proteins (sharing similarity over a single domain) in 
comparison to pairs of single-domain ones, though the overall shape of the relationship is quite similar. Further 
information is available at http://partslist.org/func or http://bioinfo.mbb.yale.edu/partslist/func 



The ultimate goal of the genome projects is to determine the 
structure and function of all the newly identified gene prod- 
ucts. Fundamentally, this will be carried out via annotation 
transfer, transferring the structural and functional annotation 
from an experimentally characterized protein (as in a model 
organism such as Escherichia coli) to a predicted protein in a 
newly sequenced genome that shares similarity in sequence. 
The degree of annotation transferred will depend on the de- 
gree of sequence similarity. This process is shown schemati- 
cally in Figure 1. In this paper, we aim to address this major 
question in bioinformatics, specifically focusing on multi- 
domain proteins, as they make up the bulk of the proteome in 
eukaryotic organisms (Gerstein 1998). 

Our work is a direct outgrowth of two previous analyses 
of ours that concentrated on single-domain proteins. In an 
earlier paper, we found that the different structural classes of 
the scop classification system have different propensities to 
carry out certain types of function (Hegyi and Gerstein 1999). 
In particular, while the alpha/beta folds were disproportion- 
ately associated with enzymes and all-alpha and small folds 
with non-enzymes, the alpha + beta structures had an equal 
tendency for both enzymatic and non-enzymatic functions. 

1 Corresponding author. 
E-MAIL Mark.Cersteln@yale.edu 

Article and publication are at http://www.genome.org/cgi/doi/10.ll0l/ 
gr. 183801. 



Wilson et al. (2000) compared a large number of protein do- 
mains to one another in a pair-wise fashion with respect to 
similarities in sequence, structure, and function. Using a hy- 
brid functional classification scheme merging the ENZYME 
and FlyBase systems (Gelbart et al. 1997; Bairoch 2000), they 
found that precise function is not conserved below 30-40% 
identity, although the broad functional class is usually pre- 
served for sequence identities as low as 20-25%, given that 
the sequences have the same fold. Their survey also reinforced 
the previously established general exponential relationship 
between structural and sequence similarity (Chothia and Lesk 
1986). 

Other Work on Establishing Relationships between 
Sequence, Structure, and Function 

Several other groups have studied the relationship between 
sequence, structure, and function in detail, attempting to de- 
termine the extent to which functional transference between 
matching proteins is feasible (Shah and Hunger 1997; Martin 
et al. 1998; Thornton et al. 1999, 2000; Zhang et al. 1999; 
Shapiro and Harris 2000; Todd et al. 2001). Orengo et al. 
(1999) analyzed protein families in the CATH database and 
concluded that > 96% of the folds in the PDB are associated 
with a single homologous family. By investigating enzymatic 
folds they also found that more than 95% of homologous 
families show either single or closely related functions. 
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Figure 1 Schematic illustrating annotation transfer. This figure illustrates the process of annotation transfer for a group of hypothetical TIM barrel 
proteins. The leftmost panel represents sequence comparisons between idealized barrel domains from a number of organisms. The next panel 
shows analogous results for structural comparison, and the panel after that, functional comparison. The rightmost panel represents sequence 
comparisons between idealized multi-domain proteins that match over a single domain, the subject of much of this paper. 



Pawlowski et al. (2000) studied the relationship between se- 
quence and functional similarity in the twilight zone of 10%- 
15% sequence similarity and found a clear correlation be- 
tween the two, with functional similarity based on the E.C. 
classification of enzymes. 

Russell et al. (1997) analyzed binding sites in proteins 
with similar 3D structures and estimated that 90% of new 
remote homolog have common binding sites and similar 
functions. Eisenstein et al. (2000) evaluated the first results 
from the structural genomics projects and found that in many 
instances the protein structure itself offers an important clue 
to its biological function. Stawiski et al. (2000) found that 
function could be predicted rather successfully for just the 
proteases. Devos and Valencia (2000) presented a critical view 
of function transference between similar sequences, high- 
lighting the limitations of this process due to errors in data- 
bases and the inherent complexity of the relationship be- 
tween protein sequence-structure and function that does not 
allow "simplistic interpretations." They also found that bind- 
ing sites are the least conserved features between related pro- 
teins while the catalytic activity of enzymes is the most con- 
served one. 

Multi-Domain Proteins with Divergent Functions: 
How Common? 

Most of these previous investigations focused on single- 
domain proteins or did not distinguish between single- and 
multi-domain ones. It is not clear how the multi-domain pro- 
teins with various functions behave with respect to functional 
conservation; namely, whether they are more or less con- 
served than their single-domain counterparts. In particular, as 
shown in Figure 1, if one multi-domain protein shares a single 
domain fold with another one, it is not clear the degree to 
which the functional conservation of these proteins is con- 
strained by the shared part, and to what degree it is influenced 
by other domains that are not shared. 

Specific groups of proteins that have the same combina- 
tion of structural domains but dramatically different func- 
tions illustrate this situation. One example is the combination 



of the SH3-domain (scop superfamily identifier 2.24.2) and 
the P-loop containing NTP hydrolase (3.29.1). While in 
higher organisms this combination is associated with presyn- 
aptic and tumor suppressor functions (SWISS-PROT names 
SP02_HUMAN and DLGI.DROME, respectively), in the lower 
Dictyostelium it was found in myosin (MYSP_DICDI). An- 
other example is the combination of the FAD/NAD(P)- 
binding superfamily and FAD-linked reductases C-terminal 
superfamily (3.4.1 and 4.12.1 superfamilies, respectively). In 
one group of proteins they appear in enzymes of the oxido- 
reductase group (e.g. OXDA_CAEEL or PHHY.PSEAE), while 
in another they are found in a dissociation inhibitor (e.g. 
GDIA_HUM AN) . It should be noted that the proteins are not 
covered completely by the structural matches, so it is quite 
possible that the rest of them contain totally different do- 
mains that are responsible for the dramatically different func- 
tions. However, do these two examples show a rather rare or 
a more frequent phenomenon? How often do multi-domain 
proteins, sharing the same structural domain composition, 
differ in their functions? 

In this paper, we attempt to provide a comprehensive 
answer to this question. This is particularly timely given that 
most of the unknown proteins in eukaryotic genomes are 
multi-domain. We use the same approach as in our previous 
analyses, comparing the sequences of the structural domains 
in scop to those of SWISS-PROT using BLAST p. We focus on 
the functional divergence of single and multi-domain pro- 
teins, extending previous investigations of single-domain 
proteins. Also, in comparison to previous work, we focus 
more on non-enzymatic functions and scop structural super- 
families, instead of folds. 

RESULTS 

Our Approach to Functional 
and Structural Assignment 

We used the blastp program (version 2.0) (Altschul et al. 
1997) to identify the scop 1.39 (Murzin et al. 1995) structural 
domains in SWISS-PROT (version 37) (Bairoch and Apweiler 
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2000) with e = 10 " 4 . We removed the hypothetical and frag- 
ment proteins. This resulted in two sets of proteins. 

Single-Domain 

Of the single-domain matches, only those that were almost 
completely covered with a match to a single structural do- 
main were selected. (The maximum number of uncovered 
residues was set at 70 with an additional condition that a 
maximum of 40 residues on the N-terminal end and 30 resi- 
dues on the C-terminus were allowed to be uncovered.) These 
criteria resulted in 1818 single-domain proteins being selected 
from SWISS-PROT. 

Multi-Domain 

We selected 4763 multi-domain proteins from SWISS-PROT. 
All of these matched (in different locations) at least two do- 
mains of known structure belonging to different scop super- 
families (see schematic in Figure 1). We also selected a subset 
of these proteins that have almost their entire length covered 
by matches with structural domains (allowing again a maxi- 
mum of 70 uncovered residues). This selection resulted in 
2829 proteins being selected from SWISS-PROT. (In all cases, 
duplicate matches were removed, i.e., a protein at a certain 
location matches only one structural domain.) 

We set out to compare these two sets of proteins for 
functional divergence. As previously, we divided functions 
into enzyme and non-enzyme (Hegyi and Gerstein 1999). En- 
zymatic functions were classified by the EC system (Bairoch 
2000). Comparisons of enzymatic functions were treated the 
same way as in our earlier analyses, that is, if they differ in the 
first three components of their respective EC numbers, they 
were considered different. This implied that our analysis dealt 
with a total of 1 12 enzymatic functions. Non-enzymatic func- 
tions were classified into 508 different categories based on a 
simple thesaurus we assembled of synonymous keywords 
drawn from SWISS-PROT description lines. In addition, we 
created 49 categories for functions that have an enzymatic 
component but which are not part of the EC system. This gave 
us a total of 669 functions (112 + 508 + 49). (The list of all the 
functional categories is described further in Table 2 below, 
and also can be found on the Web at http://bioinfo. 
mbb.yale.edu/partslist/func or http://partslist.org/func.) 

Overall Distribution of the Matches 

Figure 2 shows the most commonly observed multi-domain 
combinations in a set of recently sequenced genomes. The 
occurrences of further combinations are available from the 
Web site. Clearly, the distribution is very skewed, with certain 
combinations, such as 3.29-2.32, and 2.29-4.61 tending to 
predominate. 

Figure 3 shows the overall distribution of the single- 
domain and multi-domain matches in the different structural 
classes. The distribution of matches between enzymes and 
non-enzymes in multi-domain proteins largely agrees with 
that in the single-domain proteins. The multi-domain 
matches follow the overall tendency of the alpha/beta folds to 
be associated with enzymes to a larger extent and the all- 
alpha and small folds with non-enzymes. However, the values 
for the multi-domain matches are generally less extreme than 
for single-domains; for example, the 10-fold difference be- 
tween single-domain alpha/beta enzymes and non-enzymes 
decreases to about twofold in multi-domain proteins. Another 
significant difference is the reduction in the number of multi- 
domain non-enzymes in the all-beta and alpha + beta struc- 



FOLD PAIRS 



fold 
1 

3.29 
2.29 
4.1 
1.28 
3.4 
3.22 



fold 

2 

2.32 
4.61 
4.34 
3.29 
4.48 
4.42 
2.32 4.1 
2.32 2.33 
4.32 3.1 
3.23 4.89 
3.47 5.17 
4.72 5.13 
3.22 4.1 
3.5 3.1 
4.61 3.42 
3.3 
4.1 
4.34 
1.79 
2.34 



1.76 
4.29 
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3.22 
3.52 
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Figure 2 Distribution of multi-domain combinations amongst the 
genomes. The figure shows the occurrence of multi-domain fold com- 
binations in a number of genomes, indicating its great variability. 
Each row indicates a particular combination of scop fold pairs (using 
scop 1 .39), where a fold pair is defined as two distinct folds occurring 
in tandem in a protein. Each column represents a different genome, 
using the four-letter codes in the PartsList system (Qian et al. 2001): 
Aaeo, Aquifex aeolicus; Aful, Archaeoglobus fulgidus; Bbur, Borrelia 
burgdorferi; Bsub, Bacillus subtilis; Cele, Caenorhabditis elegans; Cpne, 
Chlamydia pneumoniae; Ctra, Chlamydia trachomatis; Ecol, Echerischia 
coli; Hinf, Haemophilus influenzae Rd; Hpyl, Helicobacter pylori; Mthe, 
Methanobacterium thermoautotrophicum; Mjan, Methanococcus jan- 
naschii; Mtub, Mycobacterium tuberculosis; Mgen, Mycoplasma geni- 
talium; Mpne, Mycoplasma pneumoniae; Phor, Pyrococcus horikoshii; 
Rpro, Rickettsia prowazekii; Seer, Saccharomyces cerevisiae; Syne, Syn- 
echocystis sp.; Tpal, Treponema pallidum. The numbers in each inter- 
section cell indicate the number of times the fold pairs occur in a 
genome. Only the 20 most common fold pair combinations are 
shown here; the remainder are shown on the Web site (http:// 
partslist.org/func). If a cell is greater than 6, it is shaded black; be- 
tween 3 and 6, gray; and below 3, white. The blank spaces show 
instances in which one of the pairs does not occur in the organism at 
all (indicated by a value of -1 in the data table on the Web site). The 
fold assignments are done in a fashion consistent with those in 
PartsList and associated systems (Gerstein 1 997; Lin et al. 2000; Dra- 
wid et al. 2001; Harrison et al. 2001; Qian et al. 2001). 



tural classes compared to the single-domain matches. Alto- 
gether, there are more enzymes than non-enzymes among the 
multi-domain proteins (2805 enzymes vs. 1958 non-enzymes) 
whereas for single-domain proteins, the opposite is true (850 
enzymes vs. 968 non-enzymes). 

Table 1 summarizes the distribution of superfamilies and 
superfamily combinations among the major functional 
classes, i.e. whether they have only enzymatic, only non- 
enzymatic or both enzymatic and non-enzymatic functional- 
ity. Altogether, 215 superfamilies were found in single-domain 
proteins and 310 in multi -domain ones. As 70 superfamilies 
were found in both, altogether 455 distinct structural super- 
families matched a SWISS-PROT protein with our required 
coverage criteria (described above). Similarly, we apportioned 
the 281 superfamily combinations observed in multi-domain 
proteins amongst different broad functional categories. 

In single-domain proteins there are about as many su- 
perfamilies with exclusively enzymatic functionality as there 
are those with exclusively non-enzymatic functions (82 vs. 
78). In contrast, in multi-domain proteins this ratio increases 
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Figure 3 Distribution of proteins amongst broad structural and 
functional classes; the distribution of the matches among the seven 
structural and two functional classes in single- and multi-domain pro- 
teins. The single-domain and multi-domain matches each total 
100%, independently of each other. The horizontal axis indicates the 
seven scop classes, which are (from 1 to 7): all-alpha, all-beta, alpha/ 
beta, alpha + beta, multi-domain, membrane, and small protein. 



to almost threefold (135 vs. 56). This agrees with the notion 
that most enzymes are multi-domain. Another difference be- 
tween single and multi-domain proteins appears in the ratio 
of superfamilies with a single function compared to multi- 
functional ones. As it is apparent from Table 1, about a quar- 
ter of the superfamilies matched single-domain proteins with 
different functions (55 of 215), whereas in the multi-domain 
proteins, this ratio increased to more than a third (119 of 310). 

Single-Domain Proteins 

Table 2 lists the two functionally most diverse structural su- 
perfamilies in single-domain proteins with some representa- 
tive functions. The most diverse superfamily, the 3.38.1 
Thioredoxin-like, has 11 different functions associated with 
it, most of them with an oxidoreductase mechanism. For in- 
stance, THIO_BPT4 is a small disulphide-containing thiore- 
doxin that serves as a general disulphide oxidoreductase, 



while TDX2_BRUMA is almost twice as long (199 aa) and 
serves as a thiol-specific antioxidant that acts against sulfur- 
containing radicals. Another interesting example of func- 
tional diversity is provided by the Scorpion toxin-like super- 
family (7.3.6). While BRAZ_PENBA is a small protein that is 
known to be 2000 times sweeter than sucrose, the other mem- 
bers of the superfamily are associated with different host- 
defense mechanisms. In insects the superfamily possesses 
antifungal activity (DMYC_DROME) or acts as a toxin 
(SCX5_BUTEU). Interestingly, in plants it can also act as an 
antifungal (AF2B_SINAL) or as an inhibitor of insect alpha- 
amylases (SIAl_SORBI). It appears that many single-domain 
proteins are toxins or allergens, or are related in other ways to 
a host-defense response. 

Based on the data we can also determine the probability 
of two single-domain proteins that match domains in the 
same superfamily category also carrying out the same func- 
tion. Using Bayes' theorem: 

P(F|S) = P(F)P(S|F)/((P(F)P(S|F) + P(-F)P(S|-F)) (D 

where 5 is the probability that two proteins share the same 
superfamily, F is the probability that two proteins have the 
same function, and ~F is the probability that two proteins do 
not have the same function. Rearranging and simplifying the 
equation we get: 

P(F|S) = 1/(1 + N(S,-F)/(N(S,F)) (2) 

where N is the number of times that the two events in the 
parentheses occur together in our database of 1818 single- 
domain proteins. This results in 

P(F|S) = 1/(1 + 8501/12516) = 68%. 

That is, the probability that two single-domain proteins that 
have the same superfamily structure have the same function 
(whether enzymatic or not) is about 2/3. 

Multi-Domain Proteins 

Table 3 lists the combinations of superfamilies that have been 
associated with the greatest number of different functions in 
multi-domain proteins, with representative entries in SWISS- 
PROT. The combination with the greatest number of different 
functions is that of 1.95.1 and 7.33.1. Although it has twice as 
many different functions as the most diverse superfamily in 



Table 1. Functional Distribution of Single-domain, Multi-domain Superfamilies, and 
Multi-domain Combinations 



Single-domain 
superfamilies 



Single 
function 



Multiple 
function 



Multi-domain 
superfamilies 



Single 
function 



Multiple 
function 



Multi-domain sfam 
combinations 



Single 
function 



Multiple 
function 



Enzymatic 
Nonenzymatic 
Both functions 
Total 



82 
78 

160 



11 
23 
15 
55 



135 
56 

191 



42 
30 
47 
119 



151 
70 

221 



16 
27 
17 
60 



The basic functional distribution of the superfamilies in single- and multi-domain proteins and the 
functional distribution of multi-domain combinations are shown. The first row lists the number of 
scop superfamilies that were associated only with enzymatic function in each category. The second 
row lists* the number associated with only nonenzymatic functions, and the third row indicates the 
number of superfamilies that were associated with both types of function. Altogether, we charac- 
terized 160 + 55 = 215 single-domain and 191 +119= 310 multi-domain superfamilies, 70 of 
which overlapped in the two categories. 
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Table 2. Most Versatile Single-Domain Superfamilies 



No. 
func 



No. 
prot 



Sfam 
comb 



Function 



SWISS-PROT ID 



SWISS-PROT function 



El. 11.1 GSHP_RAT Plasma Glutathione Peroxidase (1.11.1.9) 

263# DYL5_CHLRE Dynein, Flagellar Outer Arm-C. reinhardtii 

D260# 8SAA_BACSU Glutathione Peroxidase Homolog Bsaa 

268# REHY_TORRU Rehydri n-Tortula ruralis (Moss) 

11 69 3.38.1 266# PHOSJHUMAN Phosducin (33 Kd Phototransducing Protein) 

269# REHY_ORYSA Rad24 Protein-Oryzo sativa (Rice) 

272# THIO_BPT4 Thioredoxin (Bacteriophage T4) 

D27T#272# TDX2_BRUMA Thioredoxin Peroxidase 2 

261# BTUE_ECOLI Vitamin B1 2 Transport Peri plasmic Protein Btue 

342# BRAZ_PENBA &razze\n-Pentadiplandra brazzeana 

376#336# SCKK.TITSE Neurotoxin Ts-Kapa (TskMBrazilian scorpion) 

341#356# AF2B_SINAL Cysteine-Rich Antifungal Protein 2b (Afp2b) 

343# DEFA_ZOPAT Defensin, Isoforms B And C-Zophobas atratus 

361 # DMYC_DROME Drosomycin Precursor (Cysteine-Rich Peptide) 

361#376# SCX5 BUTEU Insectotoxin l5a-(Lesser Asian scorpion) 

336# SCX3J.EIQH leiuropeptide liKScorpion) 

203# SIA1_SORBI Small-Pr Inhibitor Of Insect Alpha-Amylases 



10 



28 



7.3.6 



34 



4.79.3 



31 0# 

311# 

231# 

31 2# 

E3.-1.- 

314# 



AB18_PEA Aba-Responsive Protein Abr18-Garden Pea 

DRR3_PEA Disease Resistance Response Protein Pi49 

MPAA_CORAV Major Pollen Allergen Cor A 1 ,-Eu. Hazel 

L1 8B_LUPLU Protein LI rl 8b (Llprl 0.1b) 

RNS2_PANGI Ribonuclease 2 (3.1.-/-)-Panax Ginseng 

SAM2__SOYBN Stress-Induced Protein Sam22 



184# CSF2_SHEEP Golony-Stimulating Factor 

381#564#184# 1L4_RAT lnterleukin-4 (B-Cell Igg Diff. Factor) 

185# LIF HUMAN Leukemia Inhibitory Factor (Lif) 

187# PRL_ANGAN Prolactin Precursor (Prl)- 

186# PLF3_MOUSE Proliferin 3 Mitogen-Regulated 

188# SOMA_PAROL Somatotropin (Growth Hormone) 



43 



1.26.1 



The most versatile superfamilies in single-domain proteins as determined from their functional description in SWISS- 
PROT, with some representatives. The keyword combinations in the fourth column were based either on the first three 
components of their EC numbers (for enzymes) or derived automatically by comparing the DE description line of 
SWISS-PROT entries to a list of synonymous keywords at http://bioinfo.mbb.yale.edu/partstist/func. A keyword num- 
ber starting with a D indicates an enzyme that does not have an assigned EC number tn its description in SWISS-PROT. 



the single-domain proteins (22 vs. 11, respectively), careful 
examination reveals that all the proteins in this category are 
DNA-binding and most of them act as hormone receptors. 

The second entry listed in the table is the combination of 
the 3.4.1 and 4.48.1 superfamilies associated with the FAD/ 
NAD(P)-linked reductases. It is an all-enzymatic combination 
and always carries out an oxido-reductase function. All the 
proteins in this category are completely covered by matches 
with these two superfamilies. The 1.78.1-2.1.1 hemocyanin- 
immunoglobulin combination seems also to be fairly con- 
served; although the proteins in this category are called by 
eight different names, most of them turn out to be extracel- 
lular larval storage proteins, except for the copper-containing 
oxygen carrier hemocyanin itself (HCY_PALVU). 

Following the same logic, we can also determine the 
probability that two proteins that have the same superfamily 
combination share the same function, viz: 

P(F|S) = 1/(1 + 32242/134230) = 81% 

This means that we have significantly greater certainty in de- 
termining the function of a multi-domain protein with a par- 
ticular superfamily combination than that of a single-domain 
protein containing a particular superfamily. We also deter- 
mined a similar probability for those proteins that have an 



almost complete coverage with exactly the same type and 
number of superfamilies, following each other in the same 
order. The probability that the functions are the same in this 
case was 91%, a considerably higher value than above. How- 
ever, if two multi-domain proteins share only a single super- 
family, the probability that they share the same function 
drops to only 35%! This greater functional certainty from 
sharing a combination of superfamilies rather than just one is 
also reflected in Table 1. While one-fourth of the single- 
domain proteins and one-third of singularly matching super- 
families in multi-domain proteins have multiple functions, 
only about one-fifth of the multi-domain combinations pos- 
sess multiple functions (60 of 281). It is also clear from the 
data that domains in larger proteins often lose their original 
function and no longer have an autonomous function. 

Seventy Common Superfamilies and Their 
Functions Compared in Single-Domain 
and Multi-Domain Proteins 

As mentioned above, of the 455 superfamilies in our analysis, 
only 70 occur in both single- and multi-domain proteins. 
Even more surprising is the small number of structural super- 
families (14) that have the same function in both single- and 
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! Table 3. Most Versatile Sup erf amlly Combinations in Multi-Domain Proteins 

| _ ; _ _ 

I No. No. Sfam 

I func prot comb. Function 



SWISS-PROT ID 



SWISS-PROT function 



29# THB_RANCA Thyroid Hormone Receptor Beta 

10# HNF4_DROME Transcription Factor HNF-4 Homolog 

31 #32# EAR2„MOUSE V-Erba Related Protein Ear-2 

29#30# ECR_MANSE Ecdysone Receptor (Ecdysteroid Receptor) 

32# ERBAJWIER Erba Oncogene Protein 

556#564#35# NGFI_XENLA Nerve Growth Factor Induced Protein l-B 

576# NR42_HUMAN Immediate-Early Response Protein Not 

36# PPAT_HUMAN Peroxisome Proliferator Activated Receptor 

37# RXTG_CHICK Retinoic Acid Receptor RXR-Camma 

38# TLL_DROVi Tailless Protein 



22 



176 1.95.1/7.33.1 



El .8.2 DHSU.CHRVI Sulfide Dehydrogenase (1.8.2.-) 

El .8.1 DLDH_ZYMMO Dihydroiipoamide Dehydrogenase (1 .8.1.4) 

8 54 3.4.1/4.48.1 El .6.4 TYTR_TRYCR Try pa hoth ion e Reductase (1 .6.4.8) (Tr) 

El .16.1 MERA_STRLI Mercuric Reductase (1.16.1.1) 

El . 6.99 NAOXJvlYCPN Probable NADH Oxidase (1.6.99.3) (Noxase) 

19# ARYB_MANSE Arylphorin Beta Subunit-(Tobacco Hornworm) 

20# CRPI_PERAM Allergen Cr-Pi Precursor-(American Cockroach) 

21#427# HCY_PALVU Hemocyanin-<European Spiny Lobster) 

8 23 1.78.1/2.1.1 22# HEXA_BLADI Hexamerin Precursor-(Tropica1 Cockroach) 

23# |SP1_TRINI Acidic Juvenile Hormonne-Suppressible Protein 

24# LSP2_DROME Larval Serum Protein 2 Precursor (LSP-2) 

546#25# SSP1_BOMMO Sex-Specific Storage-Protein 1 

Note that the combination with the greatest number of different functions is that of 1 .95.1 and 7.33.1 . Careful 
examination reveals that all the proteins with this combination are DNA-binding and most of them act as various 
hormone receptors. In particular, HNF4_DROME and NR42_HUMAN also have transcription activator functions. Note 
that these two proteins are considerably longer than the others in this group and are not covered completely by 
structural matches: A large C-terminal and a large N-terminal portion are left uncovered, respectively. 



multi-domain proteins. These are listed in Table 4; 12 of them 
have enzymatic function, supporting the notion that en- 
zymes are more conserved during evolution than non- 
enzymes. The two non-enzymatic superfamilies are the 4.29.1 
ribosomal superfamily and the 5.4.1 superfamily in penicillin- 
binding proteins. 

Table 5 presents several examples of the converse situa- 
tion, shared superfamilies that have different functions in 
single and multi-domain proteins. Comparing parts A and B 
of the table highlights the fact that although both superfami- 



lies in a multi-domain protein are often present in single- 
domain form as well, the functions in the different settings 
are only vaguely related. One example is the combination of 
the lipocalin superfamily (2.45.1) with that of the BPTHike or 
Kunitz inhibitor (7.7.1), which in higher organisms forms a 
complex protein called alpha- 1 -microglobulin (AMBP_RAT). 
Another interesting example is the combination of the 2.5.1 
Cupredoxin (occurring in the single-domain blue-copper pro- 
tein, SOXE_SULAC) and the 6.5.1 Membrane all-alpha 
(single-domain representative: BACT_HALVA, a sensory rho- 



Table 4. Superfamilies With the Same Function in Single- and Multi-Domain Proteins as Determined from Their Keyword 
Combination or First Three Components of Their EC Numbers 



Single-domain proteins Multi-domain proteins 



SWISS-PROT SWISS-PROT 
Sfam Function ID SWISS-PROT function ID SWISS-PROT function 



1.81.1 E3.2.1 CUNYJRWCH Endoglucanase (3.2.1 .4) AMYG_NEUCR 

2.66.2 E3.5.1 URE2„YERPS Urease Beta (3.5.1 .5) URE1_HELPY 
3.17.2 E6.3.5 NADE MYCPN NAD(+) Synthetase (6.3.5.1) CUAA_YEAST 
3.37.1 E3.1.3 PTP2NPVOP Protein-Tyros ine Phosphatase 2 (3.1 .3.48) PTNB_RAT 
3.67.1 E4.2.1 TRPB.VIBPA Tryptophan Synthase (4.2.1 ,20) TRP.YEAST 
4.19.1 E5.2.1 FKB1 MET1A Pep tidy I proly I C/s- Trans Isom erase (5.2.1 .8) FKB7_WHEAT 
4.2.1 E3.2.1 LYCV BPP2 Lysozyme (3.2.1.17) CH1X_PEA 
4.29.1 85# RS5_ACYKS 30s RibosOmal Protein S5 RS5.TREPA 
4.52.1 E3.4.24 SNPA_STRCS Extracellular Neutral Protease (3,4.24.-) BMPH_STRPU 
4.6.1 E3.5.1 URE3JTERPS Urease Gamma (3.5.1 .5) URE1JHELPY 
5.10.1 E2.7.7 KANU_STAAU Kanamycin Nucleotidyltransferase (2.7.7.-) DPOB XENLA 
5.4.1 161# AMPH_ECOLI Penicillin-binding Protein Amph PBPX_STRPN 



Glucoamylase Precursor (3.2.1.3) 
Urease Alpha Subunit (3.5.1 .5) 
GMP Synthase (6.3.5.2) 
Protein-Tyrosine Phosphatase (3.1 .3.48) 
Tryptophan Synthase (4.2.1.20) 
70 Kd Peptidylprolyl Isomerase (5.2.1.8) 
Endochitinase Precursor (3.2.1 .14) 
30s Ribosomal Protein S5 
Collagenase 3 Precursor (3.4.24.-) 
Urease Alpha Subunit (3.5.1.5) 

Dna Polymerase Beta (2.7.7.7) 
Penicillin-binding Protein 3x Pbp2x 
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I Table 5. Examples of Superfamlltes Present in Both Single- and Multi-Domain Proteins, 
Carrying out Different Functions 



Table 5A. 


Single-Domain Proteins 




Sfam 


Funct # 


SWISS-PROT ID 


SWISS-PROT function 


1.25.1 


352# 
183# 
El .17.4 
192# 


FTN2 HAEIN 
NICY DESVH 
RIR4 YEAST 
NLP.HAEIN 


Ferritin-like Protein 2 
Nigerythrin 

(Ribonucleotide Reductase) (1.17.4.1) 
Ner-like Protein Homolog 


1.4.3 


196# 


H1A„PLADU 


Histone H1A, Sperm 



1.81.2 


E2.5.1 


PFTB_PEA 


Farnesyltransferase Beta Su (2.5.1 .-) 


2.45.1 


^:,227#« ^ > : 
228#412# 
229# 
E5.3.99 
230#421# 


ERBP RAT 
FAB3 CAEEL 
NCAL MOUSE 
NP4 RHOPR 
PGHD HUMAN 
VNS1_MOUSE 


Epididymal-Tetinoic Acid Binding Protein 
Fatty Acid-Binding Protein Homolog 3 
Neutrophil Celatinase-Assoc. Lipocalin 
Nitrophorin 4 Precursor 
Prostaglandin^ D-lsomerase (5:3.99.2) 
Vesomeral Secretory Protein 1 


2.5.1 


231# 

232#427# 


MPA3 AMBEL 
SOXE_SULAC 


Pollen Allergen AMB A 3 (AMB A Mi) 
Sulfocyanin (Blue Copper Protein) 


3.14.2 


373# 


RRFVDESVH 


Rrfl Protein 


3.29.1 


E6.3.4 
E2.7.4 
D259# 
E2.7.1 


PURA CAEEL 
KTHY YEAST 
VA57 VACCV 
KITH_VZVW 


Adenylosuccinate Synthetase (6.3.4.4) 
Thymidylate Kinase (2.7.4.9) 
Cuanylate Kinase Homolog 
Thymidine Kinase (2.7.1 ;21) 


3.47.1 


275# 
276# 


MBL BACSU 
MREB_BACSU 


MBL Protein 

Rod Shape-determining Protein Mreb 



3.48.1 


E3.1.3 


PPA5_YEAST 


Repressible Acid Phosphatase (3.1.3.2) 


3.81.1 


D281# 
282# 


AMIC PSEAE 
LUXP_VIBHA 


Aliphatic Amidase Expression-Regulator 
LUXP Protein Precursor 


4.103.1 


E2/4/2 


TOXVBORPE 


Pertussis Toxin Su 1 (2.4.2.-) 


4.105.1 


291 # 


LECC_POLMI 


Lecti n-Polya ndrocarpa M isakiensis 


4.11.5 


295# 


TERP„PSESP 


Terpredoxin 


4.19.1 


E5.2.1 


FKB1„METjA 


Pept-Prolyl Cis-Trans Isomerase (5.2.1.8) 


6.5.1 


E3.6.1 
540#325# 


ATPL VIBAL 
BACT_HALVA 


ATP Synthase (3.6.1 .34) (Lipid-binding) 
Sensory Rhodopsin II (Sr-li) 


7.35.4 


El .9.3 
345# 


COXB RAT 
DESR_DESBI 


Cytochrome C Oxidase (1.9.3.1) (Via*) 
Desulforedoxin (Dx) 


7.7.1 


349# 


TAP_ORNMO 


Tick Anticoagulant Peptide 



(Table continues on following page.) 



dopsin) superfamilies into a component of the respiratory 
chain, cytochrome C oxidase II (COOX_ZOOAN). All these 
examples demonstrate the evolutionary advantage of a do- 
main fusion event, which creates a function that is more com- 
plex than either of the components. 

Multifunctionality vs. Sequence Similarity 

Previously, we presented a variety of graphs that show how 
the probability that two domains would share the same func- 
tion varied with respect to sequence similarity (Hegyi and 



. Gerstein 1999; Wilson et al. 2000). Figure 4 shows a similar 
graph with the calculations extended to multi-domain pro- 
teins. The figure shows that the functional divergence of a 
single domain in multi-domain proteins dramatically in- 
creases, more than twofold, compared to the single-domain 
ones. This reinforces our findings above, based only on super- 
family content, that the certainty with which we can predict 
the function of a protein based on its sequence similarity with 
a domain in another multi-domain protein, is considerably 
less than for a comparable single-domain situation. 
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\ Table 5B. Multi-Domain Proteins 



Sfam Comb. 


Funct# 


SWISS-PROT ID 


SWISS-PROT function 


1.25.1/7.35.4 


104# 


RUBYJvlETJA 


Putative Rubrerythrin 


1.32.1/3.81.1 


11# 
12# 

581 #11 # 
582#11# 


PURR HAEIN 
DEGAJ3ACSU 
SCRR STRMU 
REGA_CLOAB 


Purine Nucleotide Synthesis Repressor 
Degradation Activator 
Sucrose Operon Repressor 
Transcription Regulatory Protein Rega 


1.4.3/3.14.2 


10# 

11# 

13# 

190# 

366# 


SKN7 YEAST 
V1RG AGRT5 
RGX3 MYCTU 
PFER PSEAE 
PETR_RHOCA 


Transcription Factor Skn7 (Pos9 Protein) 
Virg Regulatory Protein 
Sensory Transduction Protein REGX3 
Transcriptional Activator Protein Pfer 
Petr Protein 


2.45.1/7.7.1 


203#1 53# 


HC_RAT 


Alpha-1 -Microglobulin/Trypsin Inhibitor 


2.5.1/6.5.1 


El .9.3 


COX2_ZOOAN 


Cytochrome C Oxidase li (1.9.3.1) 


3.29.1/3.48.1 


E2.7.1 


F26_RANCA 


6-Phosphofructo-2-Kinase (2.7.1 .1 05) 


3.47.1/5.17.1 


1# 

1#83# 


YEDO YEAST 
GR73_MAIZE 


Heat Shock Protein 70 Homolog YEL030w 
Ig-Binding Protein 



DISCUSSION 

Here we built on our previous studies on the relationship 
between protein structure and function to develop new re- 
sults related to multi-domain proteins. Throughout the paper, 
we focused on superfamilies instead of folds, as the members 
of a superfamily are presumably of common evolutionary ori- 
gin (Murzin et al. 1995). 

We found that the 4763 multi-domain and 1818 single- 
domain proteins that met our selection criteria have about 
the same distribution of structural classes, with more enzy- 
matic functions associated with the alpha/beta structural 
classes and more non-enzymatic ones with the all-alpha and 
small classes. We identified more than three times as many 
multi-domain proteins that were enzymes than single- 
domain ones (2805 and 850, respectively) and, conversely, 
about twice as many multi-domain proteins as single-domain 
ones that were non-enzymes (1958 vs. 968). 

We focused on the functional divergence of the two 
groups and found that about a quarter of the superfamilies in 
single-domain proteins are associated with multiple func- 
tions, whereas only about a fifth of the multi-domain super- 
family combinations are. Therefore, we can conclude that a 
combination of specific superfamilies results in a more spe- 
cific functional assignment for a particular protein. However, 
about one-third of the superfamilies in the multi-domain pro- 
teins were associated with multiple functions, underlining 
the lesser autonomy of a domain function in multi-domain 
protein. 

This latter finding was also supported by the difference 
in functional divergences between the two groups of proteins 
based on particular sequence similarities between the do- 
mains and SWISS-PROT proteins. As is shown in Figure 4, the 
average functional divergence of a single domain is much 
larger (more than twofold) in multi-domain proteins than in 
single-domain ones. 

We also found that only 70 of a total of 455 superfamilies 
are shared between the multi-domain and single-domain pro- 
teins and only a small fraction (14) share their functions. This 



was rather surprising to us, and should be taken into consid- 
eration in functional characterization and annotation of new 
gene products. When the functions were related in single- and 
multi-domain proteins, we could observe an increasing func- 
tional complexity with the appearance of large multi-domain 
proteins. 

Altogether, with the recent sequencing of the human 
genome and the genomes of other model organisms, we hope 




2 

0% 



0 20 40 60 80 

-log(e-value) 

Figure 4 Divergence in function with respect to sequence similar- 
ity. Relative number of matching domains with multiple functions, as 
the function of e-value threshold. Diamonds represent single-domain 
proteins, squares multi-domain ones (matching just for a single do- 
main), respectively. The first value on the X-axis starts at 4 (corre- 
sponding to an e-value=10~ 4 ). 
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that this work can contribute to the successful annotation of 
the individual gene products, and will help to avoid some 
pitfalls associated with the functional characterization of 
large, complex proteins. 

The publication costs of this article were defrayed in part 
by payment of page charges. This article must therefore be 
hereby marked "advertisement" in accordance with 18 USC 
section 1734 solely to indicate this fact. 
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